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ABSTRACT 

Analysis of aggregate Web traffic has shown that PageRank 
is a poor model of how people actually navigate the Web. 
Using the empirical traffic patterns generated by a thou- 
sand users over the course of two months, we characterize 
the properties of Web traffic that cannot be reproduced by 
Markovian models, in which destinations are independent 
of past decisions. In particular, we show that the diver- 
sity of sites visited by individual users is smaller and more 
broadly distributed than predicted by the PageRank model; 
that link traffic is more broadly distributed than predicted; 
and that the time between consecutive visits to the same 
site by a user is less broadly distributed than predicted. To 
account for these discrepancies, we introduce a more real- 
istic navigation model in which agents maintain individual 
lists of bookmarks that are used as teleportation targets. 
The model can also account for branching, a traffic property 
caused by browser features such as tabs and the back but- 
ton. The model reproduces aggregate traffic patterns such as 
site popularity, while also generating more accurate predic- 
tions of diversity, link traffic, and return time distributions. 
This model for the first time allows us to capture the ex- 
treme heterogeneity of aggregate traffic measurements while 
explaining the more narrowly focused browsing patterns of 
individual users. 
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1. INTRODUCTION 

Despite its simplicity, PageRank [2] has been a remarkably 
robust model of human Web browsing as a random surf- 
ing activity. Such models of Web surfing allow us to study 
how people interact with the Web. As more people spend a 
large portion of their time online, their Web traces provide 
an increasingly informative window into human dynamics. 
Models of user browsing also have important practical ap- 
plications, ranging from ranking search results [5] to guiding 
crawlers pj to predicting advertising revenues j6j. 

The availability of large volumes of Web traffic data is hav- 
ing two important consequences. On one hand, it motivates 
the integration of popularity measurements into search rank- 
ing algorithms [To]. On the other hand, it enables system- 
atic testing of PageRank's underlying navigation assump- 
tions [12]. Traffic patterns aggregated across users have re- 
vealed that some key assumptions — uniform random walk 
and uniform random teleportation — are widely violated, mak- 
ing PageRank a poor predictor of traffic. Such results leave 
open the question of how to design a better Web navigation 
model. Here we expand on our previous empirical analy- 



11 



by considering also individual traffic patterns [8] . 
Our results provide further evidence for the limits of Marko- 
vian models such as PageRank. They suggest the need for 
an agent-based model that carries state information and can 
account for both individual and aggregate traffic patterns 
observed in real-world data. 

We conducted a field study that allowed us to collect in- 
dividual Web traffic data from over a thousand users on the 
main campus of fndiana University. Analysis of this data 
leads to several contributions, summarized as follows: 

• We find that the traffic through hyperlinks is exceed- 
ingly more broadly distributed compared to the pre- 
diction of the PageRank model. 

• We show that the empirical diversity of sites visited by 
individual users, as measured by Shannon entropy, is 
both smaller and more broadly distributed than pre- 
dicted by PageRank. 

• We find that the distribution of times between consec- 
utive visits to the same site by a user is narrower than 
expected. Together with the previous item, this shows 



that a typical user has both focused interests and re- 
current habits. We argue that the diversity apparent 
in many aggregate measures of traffic [13| [12] is a con- 
sequence of this diversity of individual interests rather 
than the behavior of extremely eclectic users who visit 
a wide variety of Web sites. 

• We introduce a novel agent-based navigation model, 
BookRank, in which bookmarks are managed and used 
as teleportation targets, and tabbed browsing is a nat- 
ural feature of the exploration process. The model 
reconciles individual browsing behavior with aggregate 
Web navigation patterns. One key individual behav- 
ior is revealed in a rank-based choice among previously 
visited sites as restart pages for surfing. Surprisingly, 
the mechanism behind this choice matches quantita- 
tively the one reported for the selection among search 
engine results [5l FT] . 

• Finally, we demonstrate that the novel ingredients of 
BookRank allow it to improve considerably on PageR- 
ank's predictions for several empirical observations of 
both aggregate and individual traffic. 

2. BOOKRANK MODEL 

The introduction of PageRank |2| and the subsequent de- 
velopment of Google marked a turning point in the history 
of the Web. For the first time, search results were ranked us- 
ing a model of the way people navigate through Web pages. 
Other models have been proposed over the years [9j |14| [l] , 
but limited availability of empirical data on Web naviga- 
tion has prevented a systematic test of their strengths and 
weaknesses. An alleged limitation of PageRank and similar 
models lies in the lack of user memory. An agent jumps from 
one page to another without any record of where it has been 
before or intends to go. Because they cannot purposefully 
return to a page, random surfers do not form navigational 
habits. Empirical measures of navigation patterns tell us 
that this assumption does not correspond to real user be- 
havior [Hump]. 

BookRank, the model we propose here, provides random 
surfers with memory through the paradigm of bookmarks: 
each agent maintains a list of pages ranked by the number of 
previous visits. According to the BookRank model, agents in 
a population navigate the Web (in parallel or sequentially) 
according to the following algorithm. 

Initially, each agent randomly selects a starting site (node). 
Then, for each time step: 

1. Unless previously visited, the current node is added to 
the bookmark list. The frequency of visits is recorded, 
and the list of bookmarks is kept ranked from most to 
least visited. 

2. With probability pt, the agent teleports (jumps) to 
a previously visited site (bookmark). The bookmark 
with rank R is chosen with probability P(R) oc R" 13 . 

3. Otherwise, with probability 1 — pt, the user navigates 
locally, following a link from the present node. There 
are two alternatives: 

(a) With probability pb, the back button is used lead- 
ing back to the previous site. 

(b) Otherwise, with probability 1 — pb, an outgoing 
link is clicked with uniform probability. 
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Figure 1: Distributions of all traffic to sites, T (top) 
and traffic originating only from jumps to sites, To 
(bottom) for the empirical click stream, PageRank 
and BookRank. 



Note that PageRank is a special case of BookRank for 
8 = p b = 0. 

3. EMPIRICAL DATA AND SIMULATIONS 

To test the performance of our model, we compare its re- 
sults (and those of PageRank) with empirical findings from 
a field study conducted on the main campus of Indiana Uni- 
versity. The navigation data were gathered from a dedicated 
server located in the central routing facility of the Bloom- 
ington campus. This system had a 1 Gbps Ethernet port 
that received a mirror of all out bound network traffic from 
one of the undergraduate dormitories with more than 1000 
recurrent users. HTTP request data was collected from this 
network feed over a period of about two months, from March 
5 through May 3, 2008. A full analysis of the data set can be 
found elsewhere [11] . In our model we set p t = 0.15, which 
matches the empirical data as well as the value traditionally 
used for PageRank [2]. 

In our simulations, agents navigate scale-free networks 
with N nodes and degree distribution P(k) ~ k~~* , gener- 
ated according to the growth model of Fortunato et al. [15] . 
We set N = 6.3 x 10 5 and 7 = 2.1 to match the subset of the 
Web graph sampled in our data set. The functional form of 
P(R) and the exponent 8 « 1.4 are also obtained by fitting 
the corresponding individual empirical data. 

The simulation of both PageRank and BookRank models 
was repeated for a series of 250 different realizations of the 
network and performed using 1000 agents with 10 5 clicks 
allocated to each. 

3.1 Site Traffic 

Let us first analyze the aggregate distribution of traffic 
received by sites. The traffic distribution P(T) is displayed 
in Fig. [l] (top) for the two models along with our observed 
data. All distributions are well approximated by power laws 
P(T) ~ T~ a . The exponent predicted by PageRank is 
a w 2.1. Not surprisingly, this exponent matches that of 
the degree distribution in the Web graph [3]. This happens 
because random walkers visit the nodes of an undirected net- 
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Figure 2: Distributions of link traffic u) for the em- 
pirical click stream, PageRank and BookRank. 

work with a probability proportional to their degree. The ex- 
ponent changes very little when we introduce directed links 
and a small teleportation probability, as in PageRank [?]. 
The PageRank exponent does not reproduce well that of 
the empirical distribution, characterized by a fa 1.75 (see 
guide for the eye in Fig. [TJ . This discrepancy, together with 
the robustness of random walk exponents, indicates that the 
empirical data reflect some other mechanism beyond PageR- 
ank's uniform choice of the next destination. The BookRank 
model on the other hand provides an excellent prediction of 
the empirical P(T). 

Fig. [I] (bottom) shows the distribution of traffic originat- 
ing from jumps, identified by HTTP requests with an empty 
referrer. Once again PageRank's prediction is poor; PageR- 
ank assumes a uniform probability 1/JV of restarting from 
any node. In contrast, BookRank selects the next site using 
the bookmarks of each user; its prediction and the empiri- 
cal data display a power-law behavior with close exponents 
a « 1.7. Remarkably, following the same argument devel- 
oped by Fortunato et al. [TB], the exponent can be derived 
from f3 via the relationship a = 1 + The fact that the 
empirical exponents a and (3 satisfy this theoretical relation 
is a strong indication that our model's dependence on the 
rank of bookmarks for choosing teleportation targets quan- 
titatively captures real user behavior. The exponent j3 ~ 1.4 
is also surprisingly close to the one empirically observed in 
the distribution of click probability as a function of the ranks 
of results returned by a search engine [5j [7] . This provides 
further support for our hypothesis that rank-based choice 
is a key and general cognitive mechanism underlying Web 
search and browsing behavior. 

3.2 Link Traffic 

A second aggregate quantity of interest is the distribution 
of traffic across links, P(uj), as shown in Fig. [2] PageR- 
ank predicts a log-normal curve with well-defined mean and 
variance. The data instead reveal a much wider power-law 
distribution P(a>) ~ uj~ s with 8 « 1.9 (see guide for the eye 
in Fig. [2j . BookRank provides a much better approximation 
to the statistics of the link traffic than does PageRank. 

3.3 Entropy distribution 

One could easily be led to believe that the broad distribu- 
tions characterizing aggregate user behavior are a reflection 
of the variability of traffic generated by single users, thus 
concluding that there is no such thing as a "typical" user 
from the point of view of traffic generated. The following 
analysis shows that this is not the case by better character- 
izing the navigation statistics of Web users. 

To measure the diversity of the behavior of a group of 
users, we use Shannon information entropy S = — Pi log Pi 
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Figure 3: Distributions of Shannon entropy S for the 
empirical click stream, PageRank and BookRank. 
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Figure 4: Distributions of time r between consec- 
utive visits to the same site for the empirical click 
stream, PageRank and BookRank. 

where pi is the fraction of visits of each particular user to 
site i. Given n visits to a collection of sites, the entropy will 
be at its minimum S = when all visits are to a single site, 
while its maximum S = log n is achieved when a single visit 
has been paid to each of n distinct sites. Entropy offers a 
better probe into single user behavior than, say, the number 
of distinct sites visited; two users who have visited the same 
number of sites can have very different measures of entropy. 
It is therefore tempting to interpret the entropy as a measure 
of the information needed to describe the browsing pattern 
of a single user. 

In Fig. [3] we compare the distribution P(S) across users 
for the empirical data and the two models. PageRank yields 
a very narrow distribution of entropy, centered around high 
values, indicating a high degree of similarity among users' 
browsing behavior. The empirical data exhibits a wider dis- 
tribution, centered around lower values. Given an equal 
number of visits, a real user will discover fewer pages than a 
PageRank surfer. The BookRank model predicts an entropy 
distribution closer to the empirical one, with a strong de- 
pendence on the parameter pb controlling the use of tabbed 
browsing (or the back button). 

If we think of a user's browsing session as a tree [11] , 
the quantity pb corresponds to the tree's branching ratio. 
Empirically this can be estimated by averaging across users. 
Since each user has multiple sessions, we can either macro- 
average the ratios (yielding an estimate (pb) ~ 0.19 ± 0.06) 
or micro-average them (yielding a consistent but broader 
range (p b ) « 0.2 ± 0.2) . In the simulations of Fig. [3] we used 
values of pb in this range; higher values shift the entropy 
distribution closer to the empirical one. 

3.4 Times between visits 

Another way to measure the spread of a user's activity 
through the network is to sample the time required to return 
to a page after a previous visit. The distribution of return 
times r, measured in number of clicks, is shown in Fig. [4] 



Once again, the prediction generated by BookRank is sig- 
nificantly closer to the empirical P(r). PageRank produces 
a distribution with a smaller power-law exponent, overesti- 
mating the time required for a user to return to a site. This 
is a consequence of its agents' lack of internal state. Our 
model, by introducing memory in the form of bookmarks, 
clearly improves on this result. 

4. DISCUSSION 

BookRank is a non-Markovian (history dependent) model 
that better approximates aggregate Web traffic than does 
PageRank. It introduces memory through the bookmarks 
maintained by each user, from which he or she restarts 
browsing after each jump. It also allows backtracking, which 
can account for the branching behavior characteristic of multi- 
tabbed browsing and use of the back button. We have shown 
that these two processes yield more realistic estimates of the 
number of sites visited by each user and the return time to 
previously visited pages, as compared with PageRank. In- 
creasing the probability of branching, changing an otherwise 
linear browsing session into one involving multiple tabs, re- 
sults in entropy values closer to the empirical ones. Fur- 
thermore, ranking bookmarks by the number of past visits 
is sufficient to reproduce the node and link traffic distribu- 
tions. In summary, our model allows us to reproduce the 
intrinsic broadness of aggregate traffic measurements while 
better explaining the more narrowly focused browsing pat- 
terns of individual users. 

Although this model is a clear step in the right direc- 
tion, it shares some of the limitations present in previous 
models. The probability of ending a session is taken to be 
uniform, resulting in an exponential distribution of session 
lengths, while the empirical data obey a log-normal distribu- 
tion [11) . Additionally the range of values for Shannon en- 
tropy does not completely overlap with our empirical obser- 
vations. This may be addresses by enabling higher branching 
in the model; users may open many tabs from a page, which 
is hard to capture with just a back button. 

In the current model, all users are stochastic replicas with 
identical parameters. An obvious way to extend the model 
is to take into account the diversity of the users with empiri- 
cally driven distributions over their parameters (exponent /3 
of the probability distribution over bookmarks, jump prob- 
ability p t , back button usage pb). Future work can also ex- 
plore node-dependent jump probabilities to model the vary- 
ing intrinsic relevance that users attribute to sites; for ex- 
ample, search engines are likely bookmarks for most users. 
Restrictions on the subset of nodes reachable by each user 
can be used to model different areas of interest. Finally, all 
pages are not equally likely to be the end point of a brows- 
ing session (i.e., the origin of a jump); our model can be 
extended to account for this heterogeneity. 
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